Scaling Synthetic Data Creation with 1,000,000,000 Personas

Chan, Xin; Wang, Xiaoyang; Yu, Dian; Mi, Haitao; Yu, Dong

Computer Science > Computation and Language

arXiv:2406.20094v1 (cs)

[Submitted on 28 Jun 2024 (this version), latest version 8 May 2025 (v3)]

Title:Scaling Synthetic Data Creation with 1,000,000,000 Personas

Authors:Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu

View PDF HTML (experimental)

Abstract:We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2406.20094 [cs.CL]
	(or arXiv:2406.20094v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.20094

Submission history

From: Xin Chan [view email]
[v1] Fri, 28 Jun 2024 17:59:01 UTC (2,583 KB)
[v2] Tue, 24 Sep 2024 00:38:10 UTC (2,583 KB)
[v3] Thu, 8 May 2025 00:24:02 UTC (2,583 KB)

Computer Science > Computation and Language

Title:Scaling Synthetic Data Creation with 1,000,000,000 Personas

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scaling Synthetic Data Creation with 1,000,000,000 Personas

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators